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ABSTRACT: Social media surveillance is 
a requirement for governments and 
intelligence agencies around the world 
to detect and prevent hate crimes. The 
dynamic and unstructured nature of the 
textual content available on social 
media platforms makes it very complex 
to extract hate related speech patterns 
from this content. It also creates 
ambiguities in the data and therefore, 
data mining techniques become difficult 
to apply in this scenario. Several 
alternative techniques were adopted by 
different researchers in the past to cope 
with this problem and to capture and 
analyze such unstructured text for the 
purpose of hate speech detection. In this 
paper, we reviewed, categorized and 
presented a state-of-the-art of these 
techniques which were divided into 
three categories namely text mining, 
sentiment analysis and semantics. The 
challenges in the application of the 
existing techniques were also discussed 
and these can be taken up as future 
directions. 


INDEX TERMS: 
speech, NLP, 


data mining, hate 
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I. INTRODUCTION 


Today, social media platforms have 
made it possible for people around the 
world to connect and communicate with 
each other, easily. However, it has also 
made it easy for criminals to perform 
criminal activities using these 
platforms. Governments and 
intelligence agencies need to detect and 
prevent such criminal activities using 
data mining and other similar 
techniques on the data collected from 
these sources. Although, there has been 
an extensive debate about the amount of 
private information that can be gathered 
for surveillance by the governments and 
agencies [1]. 


Social media platforms such as 
Facebook, Twitter and LinkedIn 
provide extensive opportunities for the 
masses to mutually connect and engage 
in order to share knowledge and to 
communicate with each other. Textual 
data communicated over social media 
platforms in the form of comments, 
messages, and posts remains mostly 
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unstructured because people are more 
likely to use language without proper 
spellings and grammar. It creates 
ambiguity in sentences and words and 
consequently, semantic and syntactic 
problems while interpreting this data 
and uncovering logical patterns from it 


[2]. 


The problems concerning the 
ambiguities in the data can be resolved 
using textual mining or simply text 
mining. Although uncovering patterns 
from the unstructured data is a challenge 
because it involves searching and 
analyzing the structured data found in 
the database to find the patterns of 
interest. Text mining, on the other hand, 
is a blend of techniques including 
natural language processing (NLP), text 
analysis, and information retrieval. It is 
used to find out logical patterns from the 
unstructured data [3]. Text mining is 
more complex than data mining because 
of the nature of natural language. 
Moreover, social media is replete with 
unstructured textual content in its posts 
and comments. 


Several approaches to text mining 
are found in the literature. However, 
there still exist some major challenges 
related to text mining in social media 
including the fact that the data is much 
larger and dynamic in nature. Moreover, 
there are the privacy concerns. The use 
of semantic approaches for coping with 
such problems in social media is also a 
recent trend. It is the approach of giving 
meaning to the terms and concepts to 
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find relationships among them that 
uncover patterns of interest. Ontologies 
are explicit and bring forth the shared 
conceptualization of real-world 
scenarios [4]. They provide machine- 
processable semantics, therefore, are 
used in different domains such as 
software engineering, medical science, 
and others to automate the different 
tasks involved. Ontologies are also used 
in text mining in social media analysis 
[5]. Several approaches are used to 
employ ontologies to infer racial, 
religious or political hate speech using 
the content shared on social media 
platforms in the form of posts and 
comments. The current paper is 
organized as a literature review about 
the previous techniques and divides 
them into the domains of text mining, 


sentiment analysis and semantics, 
followed by a state-of-the-art of these 
techniques. Lastly, there are the 


conclusion and future directions. 
II. LITERATURE REVIEW 


Social networking platforms such as 
Facebook and Twitter are used to 
remove barriers among communities 
around the world and to share and 
communicate knowledge and thoughts 
[6]. Unfortunately, these platforms have 
also enabled people to commit insults, 
cyberbullying, and various incitements 
through racial, religious and political 
hate speech that lead to social anarchy 


[7]. 
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Hate speech as a term was used by 
Werner [8] and it is defined as an 
intentional act of insult using abusive or 
hostile messages. Several other terms 
are used in the literature including 
cyberbullying for the same purpose. 
The phenomenon of hate speech has 
negative consequences in the society 
and remains punishable around the 
world by governments using different 
laws [9], [10]. It is thus necessary for 
any government to detect and respond to 
this crime through surveillance. Here, 
we are concerned with surveillance 
involving textual data from various 
social media websites. 


Several techniques are used for 
analyzing unstructured content on 
social media. Social media platforms 
are loaded with huge amounts of textual 
data. The communication of the 
majority of social media users takes the 
form of unstructured data, not using the 
exact words and proper sentence 
structure. Although data mining 
techniques have been used to extract 
huge amounts of structured data, these 
techniques can’t work well with the 
unstructured and dynamic textual data. 
Therefore, text mining and natural 
language processing techniques have 
become popular for analyzing 
unstructured textual data. Text mining 
is an extension of data mining but it is 
more complex and different because of 
its dealing with natural language. It 
creates meaningful data from 
unstructured data patterns [2]. 
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Text mining or text analytics can be 
divided into four stages. Firstly, it 
involves capturing the data from social 
media platforms and pre-processing it 
using stemming, tokenization and stop- 
word removal techniques. The extracted 
and cleaned data is then represented 
through models to derive knowledge 
from it. For the representation and 
modeling of the data, mostly BOW (Bag 
of Words) technique is employed as 
described in [11]. It transforms the data 
collected from these mediums into 
numeric vector-based representations 
onto which algebraic operations can be 
applied. However, this technique has a 
shortcoming as well when it is used to 
analyze the relationship between 
various isolated pieces of information. 
This semantic gap can only be filled if 
we use the semantic knowledge base. 
Another text mining approach [12] 
extracts intellectual data from 
Facebook. It uses Facebook API to 
extract the data about several attributes 
of users including their age, profile, 
comments, and timeline posts in order 
to transform them into different 
representations using data mining 
techniques. The aim is to study the users 
themselves, their activities and to create 
their individual profiles. Also, this 
approach is concerned with the inter- 
user comparison of their activities and 
behavior prediction. 


Yet another technique [13] uses 
Twitter tweets to identify medication 
abuse related posts and to monitor them. 
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It identifies and organizes the tweets 
that mention these abuse related 
medications and then divides them into 
three categories of drug or medication 
abuse. The authors also developed a 
classification technique that is 
automatically supervised. It 
distinguishes the posts that signal the 
presence of drug abuse and those that do 
not. NLP and the WordNet semantic 
ontology are also employed to analyze 
the posts. For the automated supervised 
classification, several available 
algorithms such as Support Vector 
Machines (SVM) and others are used. 
Then, all classifiers are further 
combined from these algorithms using 
stacking for making the final decision 
based on individual predictions. 


A semantic approach by Mika [14] 
discusses folksonomy as a dynamic type 
of ontology related to a particular 


community-based domain (social 
network). Folksonomy allows the 
individual user to express shared 


concepts using its own choice of 
keywords. The author argues that the 
very basic and simple concept of 
ontology restricts folksonomy from 
being used across the domain while the 
problems may arise with the temporal 
extension of knowledge and the 
evolution of the social community with 
its changing members or through 
change in their commitments. This can 
invalidate the knowledge contained in 
that particular ontology. The author 
presented a tripartite model consisting 
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of the actor, concept, and instance for 
the purpose of folksonomy. This idea 
was inspired by the social tagging 
mechanism where the user uses 
different tags to communicate its 
concerns. In this way, the particular 
social media context is focused on the 
construction of such ontology. For 
instance, Flickr is used to share photos, 
CiteULike provides scientific papers’ 
tagging and so on. Although their scope 
is limited to a single website, the idea of 
tagging can be reused in other similar 
applications. 


Mynard et al. [15] presented a real- 
time semantic framework employing 
linked open data as a knowledge base to 
use in their semantic searches. They 
used the GATE-based opensource 
framework for searching and 
aggregation to analyze the textual 
content in Twitter tweets, highlighting 
them in a generic or domain 
independent environment using specific 
keywords. They employed this 
framework in the domain of political 
science for the study of change in the 
political environment and it led them to 
successfully predict UK general 
elections. However, its main focus 1s not 
a specific domain; rather, it presents a 
more generic form of analysis. Also, it’s 
limited to Twitter only. 


In [16], the authors used neural 
language models to detect hate speech 
from comments. They moved away 
from the BOW approach as it may not 
capture all hate speech if the offender 
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changes the offensive words and yet be 
clear in its meaning. Therefore, they 
used the low dimensionality approach 
of CBOW (Continuous Bag of Words) 
for the neural language model to 
become more effective and efficient in 
hate speech detection. In this approach, 
words are represented in a common 
vector space and the analysts try to 
predict a central word or comment. 


There is another technique known as 
sentiment analysis that attempts to 
detect the sentiment or emotions of the 
users from the text in order to determine 
their attitude or opinion regarding a 
certain topic. This may be thought of as 
similar to text analysis. However, it is 
mostly used to determine if some 
expression or sentence is positive, 
negative or neutral. It is also used to rate 
a product or service on the basis of these 
reviews. One such approach used for 
analyzing social media content 
employed unsupervised lexicon based 
classification [17]. The unsupervised 
classifier does not require any training 
and therefore produces a positive and 
negative response rating for a given 
expression because both positive and 
negative emotions may be present in a 
text. So, this approach makes a ternary 
prediction based on the input of the 
estimates of both positive and negative 
emotions and  whichever’s value 
overwhelms the other, becomes the 
answer in the final prediction. Authors 
demonstrated that their approach 
produces better results as compared to 
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the available supervised or machine 
learning solutions. 


Another recent empirical study 
made the use of semantic feature 
representations to better understand the 
context of user expression in order to 
identify the hate speech intent of the 
user [18]. This approach used the 
external knowledge base of semantics in 
feature representations unlike the 
previous approaches. Semantic features 
that were employed included Hatebase 
features and FrameNet features. 
Hatebase is a multilingual hate speech 
knowledge base available online. The 
approach used several feature vectors 
(H_x) such as hateful meaning, non- 
hateful meaning, offensiveness, and 
unambiguousness to average such 
vectors in order to generate knowledge 
base features of a given social media 
post. FrameNet is another linguistic 
knowledge base that provides meanings 
under different semantic frame 
categories. Each post is processed 
through such a frame semantic parser 
and a vocabulary is formulated. This 
vocabulary is then used to create the 
vector for each post representing the 
count of each frame in a given post. 
Although the FrameNet approach has 
improved hate speech detection yet it is 
presently limited to English language 
only and it is also partially affected by 
the words with multiple connotations. 


The state-of-the-art table of various 
data mining techniques is presented 
below in Table 1 and Table 2. We 
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TABLE 1 
CATEGORIZATION OF TECHNIQUES USED FOR TEXT ANALYSIS IN SOCIAL MEDIA 
Author Year Technique Category Main Idea Social 
Network 
Rahman 2012 Systematic Text User Facebook 
Mining Model Mining Attributes 
Dyuricet 2015 Continuous Bag Text Low Generic 
al. of Words Mining Dimensional 
(CBOW) Text 
Embeddings 
Sarkeret 2016 Automated Text Combining Twitter 
al. Supervised Mining Classifiers 
Classification 
Paltoglou 2012 Unsupervised Sentiment Linguistics Twitter, 
et al. Lexicon Analysis Functions MySpace, 
Classification Classifier Digg 
Mika 2007 Folksonomy, Semantics Social Generic 
tripartite Tagging 
semantic model 
Mynard 2017 GATE semantic Semantics Keyword Twitter 
et al. framework Search 
Declarative 
Senarath 2020 Semantic Semantics Knowledge Twitter 
et al. features Based 
Semantic 
Features 


divided all such techniques previously 
used for the analysis of the text in social 
media into three categories namely text 
mining, sentiment analysis, and 
semantics. Certain parameters including 
main idea, search, discovery, 
prediction, and methodology validation 
were chosen to draw a comparison 
among different techniques. 
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Table 1 shows the categories in 
which these works fall and also depicts 
the platform on which the current work 
is carried out. Techniques and the main 
idea show the adopted approach. 


Table 2 shows the comparison 
among the parameters we have 
specified, that is, search that 
corresponds to the searching of words 
and phrases from the social media and 
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TABLE 2 
COMPARISON OF TECHNIQUES W.R.T SELECTED PARAMETERS 


Technique Category Search 
l n Toa 
e Text Mining certain 
Model 
extent 


Continuous Bag of 


Words (CBOW) Text Mining Yes 


Automated 
Supervised Text Mining Yes 
Classification 
Unsupervised 
Sentiment 
Lexicon esi Yes 
Classification y 
Folksonomy, 
tripartite semantic Semantics Yes 
model 
GATE semantic , 
Semantics Yes 
framework 
Semantic , 
Semantics Yes 
features 
discovery that corresponds to the 
identification of the true intended 


meaning of the text as hate speech. 
Another parameter is prediction which 
refers to the future prediction of such 
actions using the given text or 
predicting other contents or intentions 
of the user by analyzing its contents. 
Lastly, the last column of the table 
indicates the validation status of the 
technique. 


These studies provide valuable 
insights into the approaches and 
strategies used in the past for natural 
text analysis of the social media content. 
The works mentioned here also 
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Discovery Prediction Validation 
To a certain 
Yes No 
extent 
To a certain 
Yes Yes 
extent 
Yes Yes No 
To a certain To a certain 
Yes 
extent extent 
Yes No Yes 
To a certain 
Yes Yes 
extent 
Yes No Yes 
correspond to approaches and 


techniques adopted by several other 
researchers. It has been noted in the 
current study that most approaches use 
some keywords related to a specific 
domain, for instance hate speech in this 
scenario, as a targeted search across 
social media posts. This approach is 
mostly observed in the text mining 
technique. Some researchers [12] have 
adopted the approach of profiling the 
activities of individual users to detect 
hate speech. Others [16] tried to model 
the captured data through graphs and 
other representations, so as to apply 
mathematical operations to detect or 
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predict hate speech using partial 
content. Another interesting but 
complex approach is to combine the 
results of multiple classifiers to give a 
final one as an automated yet supervised 
classification. The idea of detecting 
sentiments and emotions as in [17] can 
be very useful in hate speech detection. 
However, what if the attacker uses 
sarcastic language that may not contain 
abuse related words but still attacks 
some group or individual using 
sarcasm? This problem can only be 
dealt with if we analyze the relationship 
of words and sentences as in ontologies. 
Ontologies have also been used as 
semantic approaches for the analysis of 
textual content. 


In the field of semantics, most 
approaches by different researchers use 
linked open data for identifying hate 
speech and analyzing the text from 
social media as in [15], [19]. Although 
this approach is helpful partially, such 
data can only provide us with the 
structured knowledge base that might 
not be successful in most of the cases 
where the data obtained from the posts 
and comments is unstructured, yet 
incorporates hate speech and abusive or 
slang language. An ontology can be 
built that may represent these 
unstructured terminologies in a 
structured and semantic manner as a 
knowledge base that can reuse 
structured concepts from linked open 
data as well. Currently, there is no 
generic ontology on hate speech. 
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III. CONCLUSION 


Hate speech on social media has 
been the cause of several catastrophes 
and social anarchy in the past. The 
powerful impact of social media and our 
reliance on them has increased and so 
have the chances of hate speech. 
Textual surveillance by governments 1s 
therefore necessary to detect and 
prevent hate crimes. Most previous 
approaches in data mining can’t work 
well with the ambiguous and 
unstructured data emerging from social 
media platforms. Several techniques 
have been used to solve this problem. 
We have presented in this paper a 
review of the various types of 
techniques used in the effort to detect 
and analyze textual content on social 
media platforms. These techniques can 
be divided into text mining, sentiment 
analysis, and semantics. Each has its 
own advantages and shortcomings. We 
argued that analyzing responses or 
comments can provide insights into the 
presence and severity of hate-related 
social media posts. The use of semantic 
techniques has proven to be more 
efficient. However, there is still a need 
to explore the features of semantic web 
technologies for the detection of hate 
speech in social media, in particular to 
cope with the problems of multiple 
interpretations and sarcasm. The state- 
of-the-art presented in this paper can 
help the researchers to identify 
problems and present better solutions in 
the future. 
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